Correlates of the country differences in the infection and mortality rates during the first wave of the COVID-19 pandemic: evidence from Bayesian model averaging

The COVID-19 pandemic resulted in great discrepancies in both infection and mortality rates between countries. Besides the biological and epidemiological factors, a multitude of social and economic criteria also influenced the extent to which these discrepancies appeared. Consequently, there is an active debate regarding the critical socio-economic and health factors that correlate with the infection and mortality rates outcome of the pandemic. Here, we leverage Bayesian model averaging techniques and country level data to investigate whether 28 variables, which describe a diverse set of health and socio-economic characteristics, correlate with the final number of infections and deaths during the first wave of the coronavirus pandemic. We show that only a few variables are able to robustly correlate with these outcomes. To understand the relationship between the potential correlates in explaining the infection and death rates, we create a Jointness Space. Using this space, we conclude that the extent to which each variable is able to provide a credible explanation for the COVID-19 infections/mortality outcome varies between countries because of their heterogeneous features.

www.nature.com/scientificreports/ wave of the pandemic can not only aid in inferring whether there are any general rules in their potential impact, but also would offer guidance for future policies that aim at preventing the emergence of future epidemic crises.
To this end, here we perform a detailed statistical analysis on a large set of potential health and socioeconomic variables and explore their potential to explain variation in the observed coronavirus total infections/ deaths between countries in the first wave of the virus' spread. We focus on COVID-19 data that are generated only in the first wave of the pandemic, and thus do not account for various waves (we formally define the first wave in the next section). While this may be seen as a limitation of our analysis, we assert that for each subsequent wave, there was more knowledge about the spread of the virus and vaccines were available. This significantly impacted the way in which the population reacted to the potential susceptibility of the virus. Thus, each wave likely exhibited its own health, social, and economic characteristics, and therefore should be studied separately.
To construct the set of potential correlates we conduct a thorough review of the literature that describes the social and economic factors which contribute to the spread of an epidemic. We identify a total of 28 potential variables that describe a diverse ensemble of factors, including: healthcare infrastructure, societal characteristics, economic performance, demographic structure etc. To investigate the performance of each variable in explaining the coronavirus infections/deaths outcome, we collect a sample of 105 countries, the largest set of countries for which all data were available, and utilize the technique of Bayesian model averaging (BMA). BMA allows us to isolate the most important correlates by calculating the posterior probability that they truly regulate the process. At the same time, BMA provides estimates for the relative impact of the correlates and accounts for the uncertainty in their selection [12][13][14] . In this respect, our analysis adds value to a growing body of literature which applies Bayesian methods for investigating the critical factors that drive a certain process and, in this particular case, the outcome of the COVID-19 pandemic 15 .
Based on the studied data, we observe patterns suggesting that, during the first wave of the pandemic, there were only a few variables that acted as strong and robust correlates with the final number of registered coronavirus infections and deaths in a country. These variables are related to the effect of density in social interactions and the prevalence of overweight individuals within the population. A simple correlation analysis indicates that the heterogeneity between the countries in terms of their health, social and economic nature might be the driver of this conclusion. Thus, the initial BMA results cannot capture (potentially) significant interactions between the correlates that are relevant to a particular country. To deal with this issue, we develop the coronavirus correlates Jointness Space. The Jointness Space models the interrelation between the potential correlates in explaining the coronavirus infections/deaths outcome, and can represent a statistical foundation for understanding the relationships between variables when developing policy recommendations for preventing future epidemic crises. Using this space, we find that the routes for reducing the potential negative impact of COVID-19 should focus on decreasing the prevalence of excess weight and a small number of other variables that are relevant to those studied herein. This will reduce both the registered infections and the observed deaths due to the COVID-19 disease. In the absence of models that adequately cover all relevant aspects of infections and deaths, this study provides information about the socio-economic correlates of the coronavirus pandemic.

Preliminaries
Measuring COVID-19 infections and death rates. In a formal setting, the final number of registered COVID-19 infections per million population (p.m.p.) and the number of total COVID-19 deaths p.m.p. during the first wave of the pandemic are a result of a disease spreading process 16,17 . The extent to which a disease spreads within a population is uniquely determined by its reproduction number. This number describes the expected number of cases directly generated by one case in a population in which all individuals are susceptible to infection 18,19 . Obviously, its magnitude depends on various natural characteristics of the disease, such as its infectivity or the duration of infectiousness, and the social distancing measures imposed by the government. Also, it depends on an abundance of health and socio-economic factors that govern the behavioral interactions within a population 20,21 .
In general, we never observe the reproduction number, but rather the disease outcome, i.e., the number of infections/deaths. Thus, it is mathematically complex and computationally expensive to try and infer the reproduction number. To circumvent this problem, we utilize its known characteristics and derive a much simpler statistical model for the COVID-19 outcome. Here we choose a specific formulation in which the disease outcome is modeled as a dependent variable, through a linear regression framework, as either the log of accumulated number of registered COVID-19 infections p.m.p. or the log of the accumulated number of COVID-19 deaths p.m.p. of the country at the end of the first wave of the pandemic. We focus on registered quantities normalized on per capita basis for the dependent variable instead of raw values to eliminate the bias in the outcomes arising from the different population sizes in the studied countries. The accumulation of the registered infections and deaths spans from the day of observation of the first infection in the country, up until the last day of the first wave of the pandemic in that country. The last day is, in general, different for each country and is inferred on the basis of the level of daily government response. The estimation procedure used to infer the last day of the first wave will be discussed in more detail in the next section.
The log transformation of the COVID-19 infections/deaths p.m.p. reduces the skewness of the original data and makes the dependent variable real-valued and continuous. For such a dependent variable, the linear regression framework is the simplest tool that quantifies the marginal effect of a set of potential independent variables (correlates). Its advantage lies in the efficient and unbiased analytical inference of the strength of the linear relationship. As such it has been widely used in modeling the outcome of epidemiological phenomena (see for example Refs. [22][23][24].
A central question which arises in the model specification is the selection of the independent variables. While a literature review can offer a comprehensive overview of all potential correlates, in reality we are never www.nature.com/scientificreports/ certain in their credibility. To reduce our uncertainty, we resort to BMA. BMA leverages Bayesian statistics to account for model uncertainty by estimating each possible specification, and thus evaluating the posterior distribution of each parameter value and probability that a particular model is the correct one 25 . This has allowed the BMA technique to be used in various domains, ranging from studying correlates of economic growth 26 , up to determinants of innovation processes 27 . Recently, it was even applied for estimating the output losses during the Covid-19 pandemic 28 .
Baseline model. The BMA method relies on the estimation of a baseline model that is used for evaluating the performance of all other models. In our case, this is the model which encompasses only variables for the state of the epidemic dynamics within the country and effect of government policies regarding social distancing, contact tracing and testing procedures. We use two variables to quantify the possibility that countries are in a different state of the disease spreading process. The first variable simply measures the duration of epidemics ( d 1i ) in a country and is defined as the number of days since the first registered infection. In addition, we evaluate the time which the country had to prepare for the first wave of coronavirus ( d 2i ). This is given as the number of days between the first registered infection worldwide and the first infection in the country.
In order to assess the effect of government policies regarding social distancing, testing, and tracking of infections, we construct an aggregated government response index ( s i ). The index quantifies the average daily variation in government responses to the epidemic dynamics. As a measure for the daily variation, we take the Oxford COVID-19 government response index 29 . The Oxford COVID-19 government response index is a composite measure that combines the daily effect of policies on social distancing, testing and contact tracing in an economy. For each country, we construct a weighted average of the index from all available data since their first registered coronavirus infection, up until the end date, i.e., the date when the government response index is at its maximum value. This threshold is chosen as a means to capture the moment when a country gains the ability to control and stabilize the propagation of the disease. To emphasize the effect of policy responses implemented on earlier dates, we construct a weighted average by putting a larger weight on those dates. This is because earlier responses are supposed to have a bigger impact on the prevention of the spread of the virus. The procedure implemented to derive the average government response index is described in Section S1 of the Supplementary Information (SI). Figure 1 visualizes the results from the baseline model. We observe that the countries which had more detailed response policies also had less COVID-19 infections and mortality rates, as expected. In addition, the countries with longer duration of the crisis registered more infections and deaths p.m.p., whereas the countries which had more time to prepare for the crisis also had less infections and deaths.
It is apparent that the baseline model already has a large coefficient of determination ( R 2 ) and can significantly explain a certain amount of the cross country variations in registered COVID infections/deaths p.m.p.. However, there is still a large amount of variation that we conjecture can be attributed to various health, social, and economic correlates present within a society that are unrelated to the effects of the epidemic dynamics and government policy variables.  Table 1. For a detailed description of the potential effect of the correlates we refer to the references given in the same table, and the references therein. We hereby point out that the data for each potential correlate corresponds to the last observed value (the value in 2019). This prevents the possible problem of endogenous independent variables in the specification of the regression.
In what follows, we only describe, briefly, the basic characteristics of the set of potential correlates.
Healthcare infrastructure. The healthcare infrastructure essentially determines both the quantity and quality with which health care services are delivered in a time of an epidemic. As measures, we include 2 variables which capture the quantity of hospital beds, nurses and medical practitioners, as well as the quality of the coverage of essential health services. On the one hand, studies report that well-structured healthcare resources positively affect a country's capacity to deal with epidemic emergencies [30][31][32][33][34][35][36][37] . On the other hand, the healthcare infrastructure also greatly impacts the country's ability to perform testing and reporting when identifying the infected people. In this regard, economies with better structure are able to easily perform mass testing and more detailed reporting [38][39][40] .
National health statistics. The physical and mental state of a person plays an important role in the degree to which the individual is susceptible to disease. In countries where a significant proportion of the population suffer from diseases highly associated with the spread of an infectious disease as well as its fatal outcomes, we would expect more severe consequences of the emergent epidemics [41][42][43][44] . Specifically, metabolic disorders such Table 1. List of potential correlates of the COVID-19 first wave infections and mortality rates. Refs.

Economic performance
Economic development GDP p.c., PPP $ WDI [51][52][53][54][57][58][59] Labor market Employment to population ratio WDI www.nature.com/scientificreports/ as diabetes may intensify epidemic complications 45,46 , whereas it has been observed that the susceptibility to various diseases account for the majority of deaths in complex emergencies 47 . In addition, there is empirical evidence that adequate hygiene greatly reduces the rate of mortality, whereas overweight or asthma prevalence in the population may increase the fatality of epidemic diseases [48][49][50] . To quantify the national health characteristics, we include 6 variables that assess the general health level in the studied countries.
Economic performance. We evaluate the economic performance of a country through 4 variables. This performance often mirrors the country's ability to intervene in a case of a public health crisis [51][52][53][54][55][56] . Variables such as GDP per capita have been used in modeling health outcomes, mortality trends, cause-specific mortality estimation and health system performance and finances [57][58][59] . For poor countries, economic performance appears to improve health by providing the means to meet essential needs such as food, clean water and shelter, as well access to basic health care services. However, after a country reaches a certain threshold of development, few health benefits arise from further economic growth. It has been suggested that this is the reason why, contrary to expectations, the economic downturns during the 20th century were associated with declines in mortality rates 60,61 .
Observations indicate that what drives the health in industrialized countries is not absolute wealth or growth but how the nation's resources are shared across the population 62 . More-egalitarian income distributions are associated with better health of the population [63][64][65][66] .
Societal characteristics. The characteristics of a society often reveal the way in which people interact, and thus spread the disease. In this aspect, properties such as education and the degree of digitalization within a society reflect the level of a person's reaction and promotion of self-induced measures for reducing the spread of the disease [67][68][69][70][71] . Also, the way individuals encounter (mix) each other through their personal networks or chance encounters may influence the spread of infectious diseases 21,[72][73][74][75] . To measure the societal characteristics, we identify 4 variables.
Demographic structure. Similarly, to the national health statistics, the demographic structure may impact the average susceptibility of the population to a disease. Certain demographic groups may simply have weaker defensive health mechanisms to cope with the stress induced by the disease [76][77][78][79] . In addition, the location of living may greatly affect the way in which the disease is spread 80,81 . To account for these phenomena, we collect 7 variables.
Natural environment. Numerous studies discuss possible correlation between air pollution and COVID-19 infections and mortality rates 11,82,83 . In addition, some authors note that countries where natural sustainability is deteriorated, are also more vulnerable to epidemic outbreak 10 . On the other hand, healthy natural environments may attract more tourists, which could drive the disease spread 38 . Finally, weather patterns can also impact the infectiousness of the disease, especially exposure when there are very cold days in winter and very hot days in summer 84 . We gather the data for 5 variables which capture the essence of this characteristic.

Results
BMA estimation. We use this set of variables and estimate two distinct BMA models. In the first model the dependent variable is the log of COVID-19 infections p.m.p., whereas in the second model we investigate the critical correlates of the log of the mortality rate due to the coronavirus. For the estimation procedure we use data on 105 countries. This is the maximal set of countries for which the data on all 28 potential correlates could be attained. The summary statistics and the data gathering and preprocessing procedures are described in SI Section S2. The mathematical background of BMA together with our inference setup is given in SI Section S2. Figure 2 displays the respective results. In both situations, the variables are ordered according to their posterior inclusion probabilities (PIP), given in the second column. PIP quantifies the posterior probability that a given correlate belongs to the linear regression model that best describes the COVID-19 infections/mortality rates. Besides this statistic, we also provide the posterior mean (Post mean) and the posterior standard deviation (Post Std). Post mean is an estimate of the average magnitude of the effect of a correlate, whereas the Post Std evaluates the deviation from this value.
There are multiple ways which can be used to classify the correlates into groups depending on their probability to be included in the model. A standard approach is to divide the correlates on the basis of the difference between their posterior and prior inclusion probabilities 14 . In the inference procedure (described in SI Section S3) we initially assumed that the linear regression model which best describes the COVID-19 first wave infections and mortality rates is a result of the baseline specification and 3 additional variables. Our prior belief stems from the general observation which suggests that economies are heterogeneous, and a small number of complementing factors may contribute to the extent of the coronavirus spread, while the other potential correlates may simply behave as substitutes in terms of the socio-economic interpretation within a country. Altogether, this implies that the prior inclusion probability of each potential correlate is around 0.1. We use this attribute, together with the posterior inclusion probability of each correlate, to divide the correlates into four disjoint groups: Correlates with strong evidence: (PIP > 0.5). The first group describes the correlates which have, by far, larger posterior inclusion probabilities than prior probabilities, and thus there is strong evidence that they should be included in the true model. We find two such variables related to explaining the coronavirus infections: the overweight prevalence in the country and the population density. Both variables are positively related with the number of registered COVID-19 infections p.m.p.. When investigating the critical correlates of the coronavirus www.nature.com/scientificreports/ deaths, it appears that the overweight prevalence is the only variable for which there is strong evidence to explain the outcome and has a positive impact.
Correlates with medium evidence: ( 0.5 ≥ PIP > 0.1). There are no variables for which there is medium evidence to be a correlate of the COVID-19 number of infections in the first wave, whereas mortality from nonnatural causes is the only variable for which there is medium evidence to be a correlate of the COVID-19 death rate, with a negative effect. Correlates with negligible evidence: (PIP ≤ 0.05). All other variables have negligible evidence to be a true correlate of the coronavirus outcome. In total, we find negligible evidence for explaining the coronavirus infections in 23 variables and for explaining the coronavirus deaths in 24 variables. The division of the variables into groups allows us to assess the robustness of each potential correlate-those belonging to a group described with a larger PIP also offer more credible explanation for the coronavirus infections and death rates. Nonetheless, we point out that although the comparison between posterior inclusion probabilities and prior inclusion probabilities is a common approach, its interpretation must be taken with care due to two reasons. First, there are other methods that can be used to divide the correlates into groups which may lead to different interpretation for the credibility of the correlates to explain the coronavirus cases/deaths 90 . Second, the inhomogeneous nature of the specific features of the countries can drive our results. The presence of this phenomenon in our data be inferred by conducting a simple correlation analysis between the potential correlates. If the variables are highly correlated between each other then there is a problem of multicolinearity. Multicolinearity can lead to wider credible intervals that eventually produce less statistically reliable posterior inclusion probabilities in terms of the effect of independent variables in a model. As said in Ref. 26 , even if the posterior inclusion probability is lower than the prior inclusion probability for a given variable, it might be that this particular variable is important to decision makers under certain circumstances.
In SI Section S4 we conduct several checks to confirm the robustness of our results. In the first robustness check we investigate the impact of outliers. There were several countries which were either extremely affected by the coronavirus or displayed great immunity to the epidemic crisis. To check the robustness of our results against the presence of such data we implement the following strategy. First, we remove a country from the sample. Then, we re-perform the BMA procedure with the resulting countries. We repeat this procedure for every country and recover the median results for each potential correlate. The results indicate that the findings presented here are valid even in the presence of outliers. In the same section, we display the economies which contributed most and least to the credibility of a particular variable. These are the countries which, when excluded, lead to the minimum, respectively maximum, posterior inclusion probability of the given variable. The investigation  www.nature.com/scientificreports/ suggests that there are multiple countries which are significant contributors to the PIP value of each correlate, thus further indicating that there is heterogeneity in the health social and economic features of the countries. In the second check, we change the end date of the pandemic to be equal to the first date after the day at which the daily government response index is at its maximum and that is at least 20% lower than the daily maximum. This effectively prolongs the duration of the first wave. Nonetheless, it still does not impact the findings. In the third check, we change the dependent variable to be the raw number of infections and deaths at the end of the first wave. In other words, now the dependent variable describes counts and the linear regression framework is not a suitable model. Instead, for the estimation of the marginal impact we use a quasi-Poisson model, which is the most often used procedure when the dependent variable is given as a count that has a large variance 91 . Even in this case, the results do not change. In the final robustness check, we add a spatial weighting matrix in the baseline model in order to account for the potential spatial autocorrelation in the spread of COVID-19. Multiple studies have indicated that this effect might exist (see for example 92 ). Again our findings do not significantly change.
Definitely, even if useful for presentation purposes, the mechanical application of a threshold, or a simple comparison between the prior and the posterior, should often be avoided in practice. Each BMA analysis should be coupled with an investigation for the interrelationships between the variables in explaining the dependent variable. We perform this analysis in the subsequent section.
"Jointness Space" of the COVID-19 infections/deaths correlates. The next step in deriving the linear regression model that describes best the coronavirus infections/mortality rates is to find its dimension, i.e., the number of explanatory variables included in the model. As a measure for this quantity, BMA provides the posterior size, formally defined as the posterior belief for the dimension of the model. We find that, for the coronavirus infections p.m.p. the posterior model size is 2.21 whereas for the coronavirus deaths p.m.p. it is 1.34.
After discovering the model size, we need to specify the explanatory variables. This raises the issue of how to construct the appropriate model. One possible solution is to use the correlates with the highest PIP value and regress them on the dependent variable. However, this neglects the interdependence of inclusion and exclusion of correlates in a same model. A standard approach for resolving this issue is to conduct a statistical jointness test. The concept of jointness has been introduced within the BMA framework with the aim to capture dependence between explanatory variables in the posterior distribution over the model space 93 . By emphasizing dependence and conditioning on a set of one or more other variables, jointness moves away from marginal measures of variable importance and investigates the sensitivity of posterior distributions of parameters of interest to dependence across regressors. For example, if two variables are complementary in their posterior distribution over the model space, models that either include or exclude both variables together receive relatively more weight than models where only one variable is present. In our context, jointness tests will allow us to infer whether two variables are complements, i.e., tend to be included together in models with high posterior probability, or substitutes, i.e., models with high posterior probability tend to exclude the joint inclusion of both variables.
To better understand the properties of the COVID-19 infection and mortality rates during the first wave, we perform the jointness test developed by Hofmarcher et al. 94 . Using this test we can estimate a metric between each pair of correlates and quantify their relationship in a range between −1 and 1. In the two extremes, −1 indicates that the two correlates behave as perfect substitutes in the true model, whereas 1 indicates that they are included in the true model together. The resulting jointness metric between pairs of correlates can be used to construct a network (graph), which we refer to as the Jointness Space of the COVID-19 correlates. In this network, the nodes are the potential health, social and economic correlates, whereas the jointness values represent the edge weights. In other words, two arbitrary correlates are linked with each other by the posterior belief that both of them belong to the same linear regression model governing the coronavirus infections/mortality rate.
In theory, many possible factors may cause complementarity between the variables, such as national culture 95 , the type of healthcare system 96 or political priorities 97 . All of these are a priori notions of what dimension drives the relatedness between the potential correlates and assume that there is little flexibility in choosing the correct model. Instead, the Jointness Space follows an agnostic approach and uses a data-driven measure, based on the idea that, if two correlates are related because they offer contrasting information regarding the coronavirus outcome they will tend to be included in the true model in tandem, whereas variables that give similar information are less likely to be included together. Hence, the developed network offers a statistical view for the importance of the social, health and economic correlates when developing policies aimed at reducing the impact of epidemic crises.
The networks depicted in Fig. 3 visualize the Jointness Space of the correlates included in our BMA framework. To emphasize the complementary relationships, we connect only correlates with positive jointness. The full description for the procedure implemented for constructing the Jointness Space is given in SI Section S5. In the networks, the correlates which can be included in multiple models take a more central position whereas the periphery is constituted of correlates whose credibility in explaining the coronavirus outcome mostly substitutes the effect of other variables.
Interestingly, we observe that the topological form of the Jointness Space is not significantly determined by how we specify the dependent variable. In both situations, there is one large connected component with correlates where the central role is played by the overweight prevalence. Thus, the obtained maps suggest the first step in the construction of the linear-regression model for the COVID-19 infections/death rate in the first wave is by first focusing on the fraction of overweight persons in the country. Moreover, almost all other variables belong to the same component. Only in the case when the dependent variable is modelled through the COVID-19 deaths, Life expectancy and Health coverage are excluded from the component. Hence, the variables included in our analysis are complements in explaining the COVID-19 infections/death rates. Based on this finding, we once again assert that the next variables that will be included in the model, should be specific for the economy that is www.nature.com/scientificreports/ the subject of the study. Nonetheless, improving the features of the correlates that are located more centrally might yield a synergistic effect, thus significantly reducing the risk of a more negative COVID-19 infections/death rate.

Conclusion and discussion
In this work, we utilized Bayesian model averaging techniques to provide a comprehensive analysis for the health, social and economic correlates of that contributed to between country differences in the final number of infections and deaths during the first wave of the COVID-19 pandemic. Our findings suggest that government response policies, such as testing procedures, tracking of individuals and social distancing measures, and the state of the dynamics of the disease spread can significantly explain the variety in the coronavirus outcome between the countries. Aside from these variables, only a handful of additional variables are able to robustly explain the extent of the COVID-19 infection/deaths and thus provide general rules for the virus spread. The sole variable strongly related to the coronavirus deaths is the overweight prevalence. Countries with a larger fraction of overweight population also show greater susceptibility to fatal virus outcomes. Interestingly, besides the overweight prevalence, the population density is also a strong correlate of the registered coronavirus infections per million population. More densely populated countries display higher infection rates. Plentiful explanations can provide a possible interpretation for these results. For instance, it is known that the degree of disease spread scales proportionally with population density 98 . This is because, everything else considered, in denser populations typically there is more social mixing 21 . In a similar fashion, various explanations can be found for the observed effect of overweight prevalence. In particular, the prevalence of overweight people is closely related to unhealthy habits of living and, hence, larger susceptibility to both disease infections and fatal outcomes.
The robustness checks and the performed jointness analysis suggested that the insignificance of the other variables might not be the reason for their low PIP values. Instead, the variables which we studied have complementary effects in explaining the COVID-19 infections and death rates of the first wave of the pandemic. This led us to suspect that the results are driven by the heterogeneous health, social, and economic features of the countries. To this end, an interesting topic for future research would be to explore how the effect of the correlates evolved during the different waves of the pandemic. In the absence of a unifying framework covering the relevant aspects of the interrelation between the potential correlates during the various waves, the jointness analysis performed here (and the resulting Jointness Space) can provide the starting point for the development of a more comprehensive understanding of the factors determining the infection and mortality rates of the pandemic. Moreover, www.nature.com/scientificreports/ with an improved understanding of the dynamics of the coronavirus pandemic, the insights obtained from this analysis can influence the development of appropriate policy recommendations.

Methods
The methods and data used in this analysis are described in detail in the Supplementary Information document. The data used in the analysis are available at https:// github. com/ pero-jolak/ coron avirus-socio-econo mic-deter minan ts. All experiments were performed in accordance with relevant guidelines and regulations.